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Abstract 

Background: Technical improvements have decreased sequencing costs and, as a result, the size and number of 
genomic datasets have increased rapidly. Because of the lower cost, large amounts of sequence data are now 
being produced by small to midsize research groups. Crossbow is a software tool that can detect single nucleotide 
polymorphisms (SNPs) in whole-genome sequencing (WGS) data from a single subject; however, Crossbow has a 
number of limitations when applied to multiple subjects from large-scale WGS projects. The data storage and CPU 
resources that are required for large-scale whole genome sequencing data analyses are too large for many core 
facilities and individual laboratories to provide. To help meet these challenges, we have developed Rainbow, 
a cloud-based software package that can assist in the automation of large-scale WGS data analyses. 

Results: Here, we evaluated the performance of Rainbow by analyzing 44 different whole-genome-sequenced 
subjects. Rainbow has the capacity to process genomic data from more than 500 subjects in two weeks using 
cloud computing provided by the Amazon Web Service. The time includes the import and export of the data using 
Amazon Import/Export service. The average cost of processing a single sample in the cloud was less than 120 US 
dollars. Compared with Crossbow, the main improvements incorporated into Rainbow include the ability: (1) to 
handle BAM as well as FASTQ input files; (2) to split large sequence files for better load balance downstream; (3) to 
log the running metrics in data processing and monitoring multiple Amazon Elastic Compute Cloud (EC2) 
instances; and (4) to merge SOAPsnp outputs for multiple individuals into a single file to facilitate downstream 
genome-wide association studies. 

Conclusions: Rainbow is a scalable, cost-effective, and open-source tool for large-scale WGS data analysis. For 
human WGS data sequenced by either the lllumina HiSeq 2000 or HiSeq 2500 platforms, Rainbow can be used 
straight out of the box. Rainbow is available for third-party implementation and use, and can be downloaded from 
http://s3.amazonaws.com/jnj_rainbow/index.html. 

Keywords: Cloud computing, Whole genome sequencing, Single nucleotide polymorphism, SNP, Next generation 
sequencing, Software 



Background 

High-throughput next-generation sequencing (NGS) 
technologies from lllumina (San Diego, CA, USA), Life 
Technology (Carlsbad, CA, USA), and Roche/454 
(Branford, CT, USA) have evolved rapidly and are re- 
shaping the scope of genomics research [1-3]. Technical 
improvements have greatly decreased sequencing costs 
and, as a result, the size and number of datasets have 
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increased dramatically. The lower costs mean that se- 
quence data are being produced more often by small to 
midsize research groups. This trend is likely to continue 
as newer generation sequencing technologies keep driv- 
ing costs down [4]. The increasing volume of data has 
enabled the rapid adoption of whole-genome sequencing 
(WGS) to enhance drug research and development, 
which has led to a significant increase in the need for 
computational methods and bioinformatics tools [5,6]. 
For example, deep sequencing (30-60x fold coverage) of 
the entire human genome on an Illuminas HiSeq 2000 
platform typically generates billions of 100-bp short 
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reads, and the corresponding FASTQ files can be as 
large as 460 gigabytes (GBs). For a WGS project 
consisting of 50 subjects, 20 terabytes of disk space are 
required to store the raw reads alone. The data storage 
and CPU resources needed pose a huge practical chal- 
lenge for data analyses in a local environment. 

Fortunately, in recent years, cloud computing has 
emerged as a viable option to quickly and easily acquire 
the computational resources for large-scale data analyses 
[7-9]. Cloud computing offers network access to compu- 
tational resources where CPU, memory, and storage are 
accessible in the form of virtual machines (VMs). Using 
these VMs eliminates the need to build or administer 
local infrastructure while addressing the challenges in- 
volved in the rapid deployment of computing environ- 
ments for bioinformatics. In addition, cloud computing 
offers machines with different hardware and software 
specifications, including large-memory machines, fast- 
CPU machines, and abundant disk space. In addition, 
users can select and configure VMs to meet their differ- 
ent computational needs. More importantly, cloud com- 
puting can provide storage and computation at a far 
lower cost (both up-front costs and on-going costs) than 
resources that are often dedicated to specific projects. 
With the massive economies of scale, cloud- computing 
providers are continually driving costs down. This has 
led to considerable enthusiasm within the bioinformatics 
community for the use of cloud computing for NGS se- 
quence analyses [10]. Several cloud-based bioinformatics 
applications and resources have been developed specific- 
ally to address the challenges of working with the very 
large volumes of data generated by NGS technology. 
Cloud-based bioinformatics software include CloudBurst 
[11], Crossbow [12,13], Myrna [14], and CloVR [15]. 

Our focus has been on identifying genetic variations 
from WGS data, mainly single nucleotide polymor- 
phisms (SNPs). To do this, all the short reads are aligned 
to a human reference genome and then SNP calls are 
made. A few open-source applications are available for 
mapping large numbers of short reads to reference se- 
quences, including Bowtie [16], SOAP [17], and BWA 
[18]. These tools were designed initially for implementa- 
tion on a single compute node or on a local workstation, 
and generally require a long running time even with 
multiple threads, making them impractical for process- 
ing a large number of samples. However, when a soft- 
ware program is executed in a compute cluster in which 
many processors work in parallel, the calculations can 
be completed in significantly less time. We used Cross- 
bow [12], which is a Hadoop-based [19] parallel pipeline 
for genomic analysis, to search for SNPs using cloud 
computing. Crossbow uses Bowtie [16] and SOAPsnp 
[20] for alignment and SNP calling, respectively. Cross- 
bow harnesses cloud computing to efficiently and 



accurately align billions of reads and call SNPs in a few 
hours. Data from an over 35 x coverage of a human gen- 
ome can be analyzed by Crossbow in three hours for 85 
US dollars using a 40-node, 320-core cluster hosted in 
the Amazon cloud. When Crossbow was applied to large 
WGS projects in which multiple subjects were se- 
quenced, various limitations were observed. The focus of 
this paper is to describe the development of Rainbow to 
address these limitations, and to demonstrate its prac- 
tical usage by analyzing the genomic sequencing data 
from a large number of subjects in the Amazon cloud. 

Implementation 

Gaps and challenges for large-scale WGS analysis using 
cloud computing 

A major challenge with WGS analysis in the cloud is the 
process of transferring large data files. The raw sequence 
data generated by large-scale WGS studies are generally 
multiple terabytes (TB) in size. It is impractical to trans- 
fer this data to the Amazon cloud via a typical network 
connection. Amazon Import is an efficient service for 
uploading large volumes of data to the Amazon S3 
(Simple Storage Service) platform [21]. Users can ship 
multiple hard drives containing their data to Amazon via 
FedEx. Amazon then copies the data directly to S3. This 
process usually takes two to three days. After the data 
are uploaded to S3, the large files still need to be trans- 
ferred between the S3 platform and the Amazons EC2 
(Elastic Cloud Compute) instance [22]; this remains a 
practical challenge that is yet to be resolved. Currently, 
Amazon does not offer a built-in command line tool to 
facilitate the high-throughput transfer of large files be- 
tween S3 and an EC2 instance. 

Crossbow uses the s3cmd command line tool to down- 
load data from S3 [23], but s3cmd cannot handle data 
files larger than 5 GB. A typical FASTQ file for WGS is 
a few hundred gigabytes, much larger than the limit for 
s3cmd. [Authors note: at the time of writing, a new 
alpha version of s3cmd was released that addresses the 5 
GB limit.] Therefore, FASTQ files have to be split into 
smaller files for Crossbow runs. When Crossbow is exe- 
cuted in a cluster, it loads sequence data in multiple files 
in parallel to multiple nodes. Thus, file splitting reduces 
the data transfer time when Crossbow is run in a cluster. 
We developed a data pre-processing Python script that 
automatically splits large FASTQ files into smaller files 
and generates the corresponding manifest files as inputs 
to Crossbow runs. 

Multiple EC2 instances can be launched in Amazon to 
split raw sequence data files in parallel. An EC2 instance 
might fail, crash, hang, or run away, which is the second 
challenge for Rainbow, namely, managing and monitor- 
ing multiple EC2 instances in the Amazon cloud. For ex- 
ample, when 100 EC2 instances are launched in the 
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cloud, it is not practical to manually monitor this num- 
ber of remote instances by logging into them one by 
one. To be useful, Rainbow should be able to monitor 
and detect some common hardware and network 
failures, and respond accordingly. To manage large-scale 
WGS data analysis in the cloud it is necessary to keep 
track of the progress and status of the applications exe- 
cution, and collect and record runtime metrics such as 
processing times, transfer times, and file sizes. To ad- 
dress this issue, we developed a data pre-processing 
Python script to log the necessary information for moni- 
toring and to store the logs in S3. The script also per- 
forms automated management to ensure no common, 
foreseeable errors occur; if they do, the script can handle 
them appropriately. Take data transfer as an example. 
Data transfer from S3 to EC2 can sometimes fail because 
of network congestion in the cloud. When this occurs, 
instead of immediately terminating the EC2 instance, the 
script waits for several minutes before re-fetching the 
data. Other similar examples of automated management 
to handle common problems encountered during devel- 
opment and testing have been built into the script. 

The third challenge for Rainbow is the aggregation of 
SNPs from multiple samples in a WGS project. The out- 
puts from SOAPsnp for the sequences from a single sub- 
ject are chromosome-based plain text files in which each 
SNP is annotated in detail. Unfortunately, the SOAPsnp 
output is not in a standard format, making it difficult for 
other genome-wide association studies (GWAS) tools 
such as Plink [24] to use the data. Identification of SNPs 
is one of the first steps in a WGS project. Other tools 
are used in downstream analyses to understand the sig- 
nificance of the identified SNPs. To address this prob- 
lem, we developed a Perl script that can aggregate all the 
SNPs from multiple samples and merge them into 
chromosome-based genotype files, thereby allowing the 
files to be used as inputs to other GWAS tools. 

The fourth and final challenge for Rainbow is the de- 
livery of data from sequence providers. No standard 
has been set for providers, so data can be delivered in a 
variety of different formats. Raw sequence data are 
usually delivered as FASTQ files by shipping multiple 
hard drives to the customers. However, sequence pro- 
viders might run their analysis pipelines automatically 
after sequencing and deliver BAM files to the cus- 
tomers. (BAM is a binary version of a SAM (Sequence 
Alignment Map) file; SAM is a tab-delimited text file 
that contains sequence alignment data.) This reduces 
the number of multiple hard drives that are required 
for the data from large sequencing projects because a 
BAM file is roughly one-third the size of the corre- 
sponding FASTQ file. In a BAM file, all the raw reads 
have been aligned, but customers may want to redo the 
mapping using a different alignment program or a 



different reference genome (for example, the hgl9 
version of the human assembly instead of the hgl8 
version). When raw data are stored in the BAM format, 
the raw sequence reads first have to be extracted using 
Picard [25], and then re-aligned with Bowtie [16] or 
another alignment tool. Rainbow can use both BAM 
and FASTQ files as input and extracts sequence reads 
from BAM files on the fly. 

Cloud as the execution environment for Rainbow 

Many CPU hours are required to align raw reads in 
large datasets to a reference genome. For a subject with 
60 x sequencing coverage, it takes about two weeks to 
map 2 billion reads using Bowtie [16] on a local Linux 
machine without parallel processing. On a single 
machine, it would take about three years to align all 
the reads in a WGS project consisting of 50 subjects. 
In principle, it is possible to set up a local high- 
performance computing cluster to meet the computa- 
tional and storage challenges of large-scale WGS data 
analysis; however, this option is not always available. 
Furthermore, this option cannot be scaled up or down 
quickly to meet the needs of different sequencing 
projects. Cloud service providers give customers on- 
demand access to a wide range of cloud infrastructure 
services irrespective of the size of the data, and charge 
only for the resources that are actually used. Cloud 
service providers offer virtually unlimited storage and 
CPU resources, and provide a computational environ- 
ment that is ideally suited to large-scale WGS data 
analyses. 

We chose to build Rainbow using the Amazon Web 
Services as the cloud provider for the following rea- 
sons: (1) S3 centralizes data storage, including inputs, 
intermediate results, and outputs from every computa- 
tional step in Rainbow, and stores the data perman- 
ently. A variety of tools are available to access the data 
stored in S3, such as S3Fox, a web browser plug-in 
[26], and s3cmd, a command line tool [23]. S3 provides 
virtually unlimited storage space for cloud computing, 
and data in S3 are stored as multiple copies. Multiple 
data copies and redundancy guarantee the safety of 
data in Amazons cloud. The objects in S3 seem to be- 
long logically to the same folder, but they are not 
necessarily on the same physical device (a hard drive 
or a file system). All objects in S3 have unique identi- 
fiers and can be fetched in parallel without input/out- 
put (I/O) congestion. This parallelism is critical when 
multiple instances or clusters are uploaded to or 
downloaded simultaneously to files in S3, and EC2 
instances or clusters requested by the user can be re- 
leased after the computational tasks are complete, and 
the user no longer needs to pay for those resources. 
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Manifest file and name convention makes the use of 
Rainbow easy 

To support the centralization of all data files and to 
manage different types of files in S3 in an automated 
fashion, we used a master manifest file and naming con- 
vention. For a large WGS project, the manifest file is the 
only file that a user needs to prepare to run Rainbow. A 
master manifest file is a plain text file used to describe 
all subjects in a WGS project. Each subject has a corre- 
sponding entry in the manifest file. The format for each 
record is: 

Testl s3://test.bucket/testl.bam s3://testresults/ 
Test2 s3://test.bucket/test2_l.fastq; s3://test. bucket/ 
test2_2.fastq s3://test results/ 

Each entry consists of three fields separated by spaces 
or tabs: (1) a unique identifier; (2) locations of the raw 
reads in S3; and (3) an output folder in S3. The raw 
reads can be in either BAM or FASTQ files. The naming 
convention that we have used, together with the unique 
identifier for each genome, control how all the inter- 
mediate and result files are named, where they are 
stored, and how they are logically organized in S3. Each 
individual step in the Rainbow workflow uses this single 
manifest file as input, thus guaranteeing that all files are 
named and stored consistently. This process ensures that 
naming conflicts or overwriting of any file associated 
with another subject is avoided. 

Merging SNP outputs for multiple subjects 

Individual SNP records generated using SOAPsnp are in 
SOAPsnp output format, namely, one SNP per line with 
several tab-separated fields per SNP. The fields include 
SNP coordinate information, subject genotype, quality 
score of subject genotype, the best base and the number 
of uniquely aligned reads corroborating the best base, 
and the second best base and the count of all aligned 
reads corroborating the second best base. For each sub- 
ject, the SNPs are organized into one gzipped result file 
per chromosome, and the SNP records are sorted based 
on their coordinates on the chromosome. We developed 
a merge-sort-based algorithm to merge the SNPs 
chromosome by chromosome. When merging SNPs 
from multiple subjects, we wanted to ensure that only 
high-quality SNPs were retained for downstream ana- 
lysis. The criteria that we used to define a quality SNP 
were: (1) the "quality score of subject genotype" attribute 
is greater than 13, giving at least 95% confidence that 
the genotype is correct. The number 13 was calculated 
using the formula - 10 * log(l - 95%); and (2) at least 
two uniquely aligned reads corroborate both the best 
and the second best bases. During the SNP merging 
process, there is no need to load all the SNPs into 



memory at once. As a result, our algorithm has a very 
small memory footprint, and can merge SNPs from a 
very large number of subjects. The merged genotypes 
are stored as plain text files with one row corresponding 
to one SNP marker, and with each column correspond- 
ing to one sample. 

Results 

Description of Rainbow 

The workflow of Rainbow is shown in Figure 1. A user 
first ships multiple hard drives to Amazon via FedEx, 
and Amazon copies the data to S3 directly. This process 
typically takes two to three days. After the BAM or 
FASTQ files have been uploaded to S3, they can be 
processed in parallel by launching multiple EC2 in- 
stances or clusters in the cloud. When the analysis is 
complete, the results can be downloaded directly or 
exported back via Amazon Export, which takes an add- 
itional two to three days. 

There are four major steps in WGS data analysis 
(Figure 2). Step 1, the data pre-processing step, auto- 
mates (1) the extraction of raw reads from BAM files; 
(2) the splitting of large FASTQ files into smaller files; 
and (3) the generation of manifest files as inputs to 
Crossbow. In step 1, some user data are passed to the 
EC2 instance through the Cloud-init mechanism [27]. 
This user data is an executable shell script that is re- 
sponsible for downloading the pre-processing Python 
code from S3. The Python script is responsible for 
software installation, system configuration, fetching data 
from S3, extracting raw reads, splitting files, and uploading 
all the results to S3. Steps 2 and 3 are performed by Cross- 
bow and are responsible for mapping reads to the reference 
sequence and for SNP calling. A Perl script is used to parse 
a master manifest file, prepare the Crossbow command 
line, and launch the Crossbow run in the cloud for each 
subject. Step 4 uses a Perl script that was developed to con- 
solidate the SNPs for all samples. 

A practical test run 

We applied Rainbow to analyze the 44 subjects listed in 
Table 1. All 44 subjects were pair-end sequenced on 
Illumina HiSeq 2000 platforms. The estimated insert size 
was approximately 300 bp. Each subject generated 1.1-2 
billion 100-bp short reads. The largest BAM file was 190 
GB, and the corresponding FASTQ files were 460 GB (2 
x 230 GB). Sequencing coverage ranged from 30 to 60 x. 
All the raw data were in BAM format and delivered to 
us on four 2 TB hard drives. We shipped the hard drives 
to Amazon s Import Service and loaded the data files to 
S3. Then, we launched 44 ml large instances in the 
Amazon cloud to pre-process the raw data in parallel 
and to upload all split files to S3. After data pre- 
processing, 44 clusters were launched in parallel in the 
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Figure 1 The Rainbow pipeline. S3 centralizes data storage, including inputs, intermediate results, and outputs. A typical scenario to run 
Rainbow to analyze large-scale WGS data is import — ► execute — ► export. Alignment and SNP call are performed by Crossbow in a cluster with 
multiple nodes. AWS, Amazon Web Service. 



cloud to align the reads and make SNP calls. Finally, the 
SNPs from all 44 subjects were merged and a master 
chromosome-based genotype file was generated for fur- 
ther analysis. 

The total cost of shipping the four 2 TB hard drives 
was 150 US dollars per 2 TB hard drive including (a) 30 
US dollars for FedEx shipping, (b) a flat 80 US dollars 
charge per device, and (c) 42 US dollars for the data 
loading fee. The additional charge for data loading was 
2.49 US dollars per data-loading-hour, and the actual 
cost is dependent upon the I/O speed of the storage de- 
vice and the data size. It took one business day to ship 
the hard drives to Amazon by FedEx, and Amazon 
started to upload the data within 24 h of receiving the 
hard drives. 

The running environments were as follows. For step 
#1, we chose the Amazon ml. large instance, which has 
two CPUs, 7.5 GB memory, and two 420 GB instance 



drives. For each instance, an extra 220 GB EBS (Elastic 
Block Storage) [28] volume was requested to store the 
raw BAM file. When running Picard [25], 6 GB memory 
was allocated to the Java Virtual Machine. Picard read 
the BAM file from EBS, and output pair-ended se- 
quences into two FASTQ files (Fastq_l and Fastq_2) 
stored in the two instance drives. When splitting files, 
Fastq_l and Fastq_2 were processed in parallel because 
an ml. large instance has two CPUs, and Fastq_l and 
Fastq_2 were logically in different instance drives. All re- 
quested resources (CPUs/storages) of the ml. large were 
used fully, which made it the optimal choice. For steps 2 
and 3, each compute cluster had 40 cl.xlarge nodes as 
recommended by the Crossbow developers. Each cl. 
xlarge node has eight CPUs, 7 GB memory, and 1690 
GB instance storage. 

The performance of Rainbow is summarized in Figures 3, 
4, 5, 6. The relationship between the download time and 
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Figure 2 The four main steps in WGS data analysis using Rainbow. Steps 1, 2, and 3 can be executed in parallel in the cloud by launching 
multiple instances or clusters. Crossbow performs both the alignment and the SNP call. 



Table 1 Description of the WGS data for 44 subjects and a summary of the detected SNPs 



Sample 


BAM File (GB) 


FASTQ (GB) 
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4344555 


SG1235 


89 
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2370792 


1465 


3851365 
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96 
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1382 


1490258 


2752238 
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4245296 
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115 


346 


1552 
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2715088 


2983 
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Ln 


SG1238 
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322 


1432 


1504343 


2327337 


1457 


3833137 




SG1239 


93 
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1463430 


2517434 
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91 
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89 
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97 
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93 
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1356 
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2467669 
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3945951 




SG1244 


95 


269 
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1477605 


2462121 


1613 
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SG1245 


98 
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97 
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4088782 
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4034677 
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Table 1 Description of the WGS data for 44 subjects and a summary of the detected SNPs (Continued) 



SG1263 
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349 


1552 


1470360 


2653711 


2075 


4126146 


SG1264 


122 


376 


1690 


1486174 


2631507 


2034 


4119715 


SG1265 


101 


307 


1378 


1489230 


2495349 


1800 


3986379 


SG1266 


118 


352 


1576 


1479540 


2552931 


2014 


4034485 


SG1267 


99 


310 


1382 


1486385 


2490343 


1874 


3978602 


SG1268 


105 


334 


1488 


1473782 


3225627 


2422 


4701831 


SG1269 


114 


358 


1608 


1512683 


2477654 


2043 


3992380 


SG1270 


108 


298 


1340 


1557762 


2692982 


2656 


4253400 


SG1271 


127 


388 


1742 


1516008 


2596244 


2149 


4114401 


SG1272 


87 


256 


1134 


1406043 


2820949 


1743 


4228735 


SG1273 


183 


461 


1950 


1488525 


2393320 


2188 


3884033 



a Homo_SNPs are SNPs where both alleles are the same but different from the reference. 
b Hetero_SNPs are SNPs where one allele is the same as the reference and the other allele is different. 
c Het2_SNPs are SNPs where both alleles are different from the reference, and different from each other. 
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BAM file size is shown in Figure 3. Outliers of the trend 
line reflect fluctuations of network traffic. Downloading 
was performed using boto [29], an open source Python 
tool for easy connection to the Amazon Web Service. On 
average, a data transfer speed of 1.3 GB per min was 
achieved. Picard [25] comprises Java-based command-line 
utilities that manipulate BAM and SAM files. One of the 
utilities was used to extract raw sequence reads from 
BAM/SAM files. The more reads in a BAM file, the longer 
it takes to extract them, and the linear relationship be- 
tween Picard processing time and the number of reads 
represents this (see Figure 4). It usually takes one to two 
days for Picard to complete a run on an ml. large node; it 
would not have been faster if another more powerful EC2 
instance had been chosen. In addition, there is no way to 
run Picard in parallel to process a BAM file. The output of 
a Picard run is two large FASTQ files. It takes another 
eight to 16 h for FASTQ file splitting, compression, and 
uploading to S3. Step 1 will of course take much less time 
if the raw reads are in FASTQ files rather than in a BAM 
file. 

Steps 2 and 3 are much more time consuming than 
step 1, but they can be completed in a much shorter 
time in a cluster with multiple nodes (see Figures 5 and 
6). In a 320-CPU cluster, the alignment of billions of 
reads takes between 0.8 and 1.6 h, whereas, on a local 
resource, it could take up to 12 days. The linear 
relationship shown in Figure 5 is accurate because the 
sequence data blocks in the HDFS (Hadoop Distributed 
Filesystem) [30] were physically local to the nodes that 
processed them, which reduces virtual I/O delays. Cross- 
bow runs rarely failed because the Hadoop-based cluster 
was built to run on commodity hardware, and Hadoop 
has built-in mechanisms for failover and disaster recov- 
ery. A Hadoop-based cluster not only reduces the run- 
ning time by processing the data in parallel, but also 
significantly improves the robustness of applications. 
The SOAPsnp running time (see Figure 6) ranged from 
1 to 1.8 h, which overall, was a little longer than step 2. 
The numbers of SNPs identified from the 44 subjects 



are listed in Table 1. On average, about 4 million SNPs 
were identified from a single subject. After combining 
the SNPs from all 44 subjects, roughly 15 million unique 
SNPs were obtained. It was very rare for both alleles to 
be different from the reference sequence, indeed, only a 
few thousand SNPs per subject fell into this category 
(Table 1). 

On average, it cost less than 120 US dollars to 
analyze each subject, and the total cost for analyzing 
44 subjects was around 5,800 US dollars, including 
data import. All EC2 instances and clusters are termi- 
nated immediately after the jobs on them finish, and 
the large amount of data in S3 can be deleted after data 
export to reduce continual charges. No upfront invest- 
ment in infrastructure is required and there are virtu- 
ally no additional administrative costs involved in 
using the Amazon Web Service. More important than 
the cost is the ability to scale Rainbow up or down so 
that the analyses are accomplished in a short time. 
Currently, we are working on a large whole-genome se- 
quencing project in which 438 subjects are sequenced. 
Rainbow will be able to process the data from these 
438 subjects in less than two weeks, including the 
physical data import and export with Amazon. 
Compared with the 30 years or so it would take to 
process this number of subjects sequentially on a local 
machine, the time savings of parallel processing 
enabled through Rainbow are obvious. 

Discussion 

Rainbow was built on Crossbow, but the complexity of 
the Crossbow command line is hidden, which facili- 
tates its use for large-scale WGS analysis in the cloud. 
Rainbow has been well tested, and has proven to be ro- 
bust and scalable. The implementation is open-source 
based, and is available for third-party deployment and 
use. Illumina HiSeq 2000 and 2500 are currently the 
dominant sequencing platforms, and accordingly, the 
default parameters in Rainbow have been optimized 
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Figure 3 Download time from S3 to EC2 instance in the cloud versus the BAM file size. 
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Picard Processing Time vs. #Reads 
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Figure 4 Running time of Picard versus the number of sequence reads. 



and finely tuned for data generated by these platforms. 
User can tailor the parameters for other platforms, 
but this is rarely needed in practice. For human WGS 
data sequenced on an Illumina HiSeq 2000 or HiSeq 
2500 platform, Rainbow can be used straight out of the 
box. 

The low cost of whole genome sequencing has led to 
the rapid adoption of WGS for drug research and 
development. To understand the relationship between 
SNPs and disease better, and to obtain insights into the 
relation between SNPs and drug response, large-scale 
sequencing projects are continuously being initiated in 
academic institutes and drug companies [31,32], These 
sequencing projects need high-performance comput- 
ing capabilities for WGS data analyses. As we have 
shown, cloud computing drives down infrastructure 
costs both up-front and on an on-going basis, and 
offers operational advantages, such as setting up 
infrastructure in minutes rather than months, complet- 
ing massive computational projects with a large 
number of resources quickly, and scaling the architec- 
ture up and down to provide the required computa- 
tional resources. 

Analyzing large datasets in the cloud is different 
from performing the same analysis in a local 



environment. By implementing and running Rainbow 
in the cloud, we have learned many valuable lessons. 
Here we summarize what we have learned while devel- 
oping and testing Rainbow. 

• Boot time should be taken into account when new 
resources are starting up. It is good practice to give 
cloud providers 10-15 min before attempting to use 
a newly requested resource. 

• It is not trivial to move large datasets around in 
the cloud. Users should be prepared to handle 
network congestion or failures. When data 
transfers fail, it is advisable to wait for 5-10 min 
before retrying. 

• Cloud providers typically offer a variety of 
compute instances to choose from. It is necessary 
to understand the bottleneck (CPU, I/O, or 
network) for the algorithm that is to be run, and 
choose the best option accordingly. 

• When large amounts of data are moved between 
cloud resources, it is essential to ensure that they 
are in the same region or data center. 

• It is difficult to debug workflows in the cloud 
without heavy logging. 
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Figure 5 Running time of Bowtie alignment versus the number of paired sequence reads. The cluster consisted of 40 cLxlarge instances. 
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Figure 6 Running time of SOAPsnp versus the number of paired mapped reads. The cluster consisted of 40 cLxlarge instances. 



Conclusions 

We have described the motivation and implementation 
of Rainbow for large-scale WGS data analyses in the 
cloud. Rainbow has the capacity to process more than 
500 subjects in two weeks using the Amazon Web 
Service, including physical data import and export with 
Amazon. The average cost to process a single sample in 
the cloud was less than 120 US dollars. In essence, 
Rainbow is a wrapper of Crossbow, which can handle 
additional challenges in large-scale WGS data analyses. 
Compared with Crossbow, the main improvements of 
Rainbow include the ability (1) to handle BAM as well 
as FASTQ files as inputs, (2) to split large sequence files 
for better load balance downstream, (3) to log the run- 
ning metrics in data processing and monitoring multiple 
EC2 instances, and (4) to merge SOAPsnp outputs for 
multiple subjects into a single file to facilitate down- 
stream GWAS studies. Rainbow is scalable and easy to 
use. 

Availability and requirements 

Rainbow is available for third-party use. Because it was 
built for use with Amazon Web Services, users need to 
first set up an Amazon account before launching Rain- 
bow from a Linux machine. The source code for Rain- 
bow is freely available for download; however, users 
need to pay Amazon to run analyses in the Amazon 
cloud. Detailed instructions on using Rainbow are avail- 
able on the Rainbow website (http://s3.amazonaws.com/ 
jnj_rainbow/index.html) [33]. 
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